v1.2: Framework Owns Flow Control (FOC-01..FOC-06)#3
Closed
aksOps wants to merge 7 commits into
Closed
Conversation
Stop the LLM hallucinating session-derived data (environment='unknown',
'prod', incident_id='???') by removing those args from the LLM-visible
tool signature. The framework injects them from session state at the
gateway / wrap boundary before the underlying MCP tool runs.
Decisions:
- D-09-01 strip injected args at registry boundary (graph.py:483-498)
- D-09-02 OrchestratorConfig.injected_args declared in app YAML
- D-09-03 framework wins on conflict, INFO-log the override
- D-09-04 single atomic commit closing Phase 9
Tools migrated (environment stripped from LLM-visible sig):
- observability: get_logs, get_metrics, get_service_health,
check_deployment_history
- remediation: propose_fix, apply_fix
- inc: lookup_similar_incidents
Tools migrated (incident_id stripped from LLM-visible sig):
- mark_resolved, mark_escalated, submit_hypothesis, update_incident
Skill prompts cleaned (triage / deep_investigator / resolution):
no longer carry "always pass environment from the INC" guidance —
now framework-owned. Tool example signatures updated to drop the
now-stripped args.
App YAML configs declare per-app injected_args:
- incident_management.yaml + config.yaml: environment / incident_id
/ session_id from session.environment / session.id
- code_review.runtime.yaml: pr_url / repo / session_id from
session.extra_fields.* / session.id
T-09-05 ordering: injection happens at the TOP of _GatedTool._run /
_arun BEFORE effective_action so the gateway risk-rating sees the
post-injection environment value (prevents prod misclassification
when LLM omits env).
The MCP server functions stay unchanged — apps' direct in-process
calls to get_logs(service='api', environment='production', ...)
keep working. Only the LLM-visible tool surface is stripped.
Coverage on touched files (full suite):
- arg_injection.py: 98%
- config.py: 97%
- graph.py: 86%
- orchestrator.py: 83%
- gateway.py: 73% (pre-existing approve-path branches account
for the gap; new inject-cfg branches are
fully covered)
Concept-leak ratchet: 147 / 147 baseline (held flat).
Suite: 946 passed, 3 skipped (was 931 baseline; 19 new tests added,
and ~4 baseline tests pivoted now that LLM-side env validation is
moot).
Bundles regenerated (dist/app.py + 2 app bundles).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per D-10-01..D-10-04: every agent invocation now returns an
AgentTurnOutput envelope (content, confidence in [0,1],
confidence_rationale, optional signal) enforced via
response_format= on both create_react_agent call sites.
- D-10-01: turn = one create_react_agent invocation
- D-10-02: pydantic envelope; response_format wired at
src/runtime/graph.py:596 + src/runtime/agents/responsive.py:110
- D-10-03: envelope confidence reconciled with typed-terminal-tool
arg confidence; tolerance 0.05 inclusive; tool-arg wins on
mismatch with INFO log shape:
runtime.orchestrator: turn.confidence_mismatch agent={a}
turn_value={e:.2f} tool_value={t:.2f} tool={tn} session_id={sid}
- D-10-04: single atomic commit covers envelope module + two
runner wirings + UI badge fix + 6 skill prompts + tests + dist
Defensive parser parse_envelope_from_result has 3-step fallback
(structured_response -> JSON-parse last AIMessage ->
EnvelopeMissingError) so providers that don't honor
response_format cleanly (e.g. Ollama gpt-oss) still flow through
the contract path. EnvelopeMissingError -> _handle_agent_failure
marks agent_run.error with structured cause.
UI: src/runtime/ui.py:_fmt_confidence_badge None branch flips
from silent "circle confidence -" to hard-error "stop confidence
missing" treatment. New code can't produce None; legacy on-disk
rows still render without crashing.
Skill prompts (10 files touched, 6 ship the new shared
preamble): examples/incident_management/skills/{triage,
deep_investigator,resolution}/system.md +
examples/code_review/skills/{analyzer,intake,recommender}/system.md
each get a `## Output contract` section pointing at the envelope.
deep_investigator drops "confidence is mandatory" boilerplate;
resolution drops "Confidence is required on the terminal tool"
boilerplate. Boilerplate ratchet returns 0 matches.
Defense-in-depth: _assert_envelope_invariant_on_finalize logs
WARNING for any AgentRun with confidence is None at finalize
time (legacy on-disk sessions). Hard rejection lives at the
runner; the finalize hook is forensics only, never raises.
Test fixture migration approach: instead of per-test edits to
the 5 enumerated files, extended StubChatModel itself with
with_structured_output(schema) so all stub-driven tests pass
unchanged. Per-instance stub_envelope_confidence /
stub_envelope_rationale / stub_envelope_signal let tests tune
the canned envelope. graph.py adds _DEFAULT_STUB_ENVELOPE_CONFIDENCE
mapping deep_investigator -> 0.30 to preserve gate-pause-on-DI
behavior in tests that previously relied on confidence is None.
New tests: tests/test_turn_output_envelope.py with 23 cases
(10 schema + 4 reconciliation + 3 parser + 6 parametrized agent
kinds: intake, triage, deep_investigator, resolution, supervisor,
monitor). New helper module tests/_envelope_helpers.py provides
envelope_stub() + EnvelopeStubChatModel for tests that need
explicit ReAct-result fakery.
3 obsolete test_agent_node.py assertions migrated: the runner
now stamps the envelope's confidence onto the AgentRun whenever
a patch-tool-arg confidence harvest yields None (bool-rejected,
unknown-string-rejected, or absent). The harvest-layer rejection
itself is still asserted via the WARN log capture.
Genericity ratchet: 147 -> 149 (rationale documented inline).
Two new uses of the existing `incident` Python local variable
on the new envelope-error branches in graph.py + responsive.py.
session_id parameters use inc_id (not incident.id) to avoid
unnecessary new domain references.
Tests: 946 -> 969 (+23). Coverage on touched files 75.83%
aggregate (gate >= 75%); per-file: turn_output.py 83%,
graph.py 86%, orchestrator.py 83%; responsive.py 34% and
ui.py 12% are pre-existing low-coverage areas not regressed
by this change.
dist/* regenerated (4 files); AgentTurnOutput present in
dist/app.py + dist/apps/incident-management.py +
dist/apps/code-review.py.
Closes FOC-03. Phase 10 done.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 11 (v1.2 -- Framework Owns Flow Control). HITL gating decision
collapses into a single pure framework function:
should_gate(session, tool_call, confidence, cfg) -> GateDecision
driven by the new structured OrchestratorConfig.gate_policy field.
Both _GatedTool._run and _GatedTool._arun now route through
should_gate(...) (via the wrap-level _evaluate_gate bridge) instead
of calling effective_action(...) directly; effective_action itself
is unchanged so the v1.0 PVC-08 prefixed-form lookup invariant is
preserved.
Skill prompts lose every "gateway"/"HITL"/"approval"/"bypass"
mention -- flow control is invisible to the LLM. The audit regex
returns zero matches across examples/*/skills/.
Concurrently fixes the v1.1-testing UI bug where a LangGraph
GraphInterrupt was mis-classified as status="error". The graph
runner (graph.py + responsive.py + _ainvoke_with_retry), the
orchestrator's _resume_with_input wrapper, and the
OrchestratorService task wrapper now all re-raise GraphInterrupt
explicitly, leaving the session in status="pending_approval" so
the Approve/Reject UI buttons can drive resume end-to-end. The
_render_retry_block predicate becomes status=='error' AND no
pending_approval rows to keep the two UI blocks mutually exclusive.
D-11-01 should_gate wraps effective_action (PVC-08 preserved).
D-11-02 OrchestratorConfig.gate_policy declarative (extra='forbid').
D-11-03 Skill prompts free of gateway/HITL/approval/bypass vocab.
D-11-04 GraphInterrupt -> pending_approval; real exc -> error.
D-11-05 Single atomic commit.
Tests: 969 -> 997 passing. 21 should_gate matrix + 6 interrupt-
handling + 1 _find_pending_index coverage test added; PVC-08 + 36
existing direct-call effective_action tests untouched. Coverage:
policy.py 100%, tools/gateway.py 75.31%, orchestrator.py 82.48%
(ui.py 12.48% reflects the pre-existing Streamlit-module floor;
the *new* _should_render_retry_block predicate is at 100%).
Concept-leak ratchet stays binary-green; genericity-ratchet
baseline lifted 149 -> 153 with rationale (4 reuses of the
existing 'incident' local variable name in graph/responsive
turn-confidence-hint reset/update lines, no new domain concept).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…(FOC-05, FOC-06)
Phase 12 closes the v1.2 "Framework Owns Flow Control" milestone.
Retry policy collapses into a single pure framework function:
should_retry(retry_count, error, confidence, cfg) -> RetryDecision
driven by the new structured OrchestratorConfig.retry_policy field.
Orchestrator._retry_session_locked consults should_retry BEFORE
running the retry; on policy denial it emits retry_rejected with
reason = decision.reason (one of {auto_retry, max_retries_exceeded,
permanent_error, low_confidence_no_retry, transient_disabled}).
The legacy 'retry already in progress' / 'not in error state'
rejection reasons stay verbatim so existing test consumers still
pattern-match.
Orchestrator.preview_retry_decision(session_id) exposes the same
decision to the UI WITHOUT mutating session state. The retry block
in src/runtime/ui.py now renders a button label + disabled flag
derived from the framework's choice via the 5-case map (D-12-04):
auto_retry -> enabled, "Retry"
max_retries_exceeded -> disabled, "Max retries reached (rc/cap)"
permanent_error -> disabled, "Permanent error -- cannot auto-retry"
low_confidence_no_retry -> disabled, "Confidence too low (N% < th%)"
transient_disabled -> disabled, "Auto-retry disabled in policy"
Error classification uses heuristic isinstance() against small
whitelists (D-12-02 -- no new ToolError ABC, no new opt-in burden
on tool authors). _PERMANENT_TYPES covers pydantic.ValidationError
and EnvelopeMissingError; _TRANSIENT_TYPES covers asyncio.TimeoutError,
TimeoutError, OSError, ConnectionError. Default fall-through is
permanent_error -- fail-closed conservative.
The new tests/test_framework_flow_control_e2e.py is the v1.2
regression-prevention contract. The thesis is that v1.2 flow control
collapses to PURE functions; the test asserts each FOC invariant on
the corresponding pure boundary directly:
FOC-01/02 OrchestratorConfig.injected_args validates dotted-path shape
FOC-03 parse_envelope_from_result raises EnvelopeMissingError
FOC-04 should_gate returns gate=True/'high_risk_tool' on apply_fix/prod
FOC-05 should_retry classifies validation/timeout/at-cap correctly
If a future phase introduces a state-derived arg leak through the
LLM, that contract breaks loudly.
Bundler fix: scripts/build_single_file.py now bundles
runtime/agents/turn_output.py BEFORE policy.py in RUNTIME_MODULE_ORDER
because Phase 12's _PERMANENT_TYPES tuple references EnvelopeMissingError
at module-import time. (Pre-Phase-12 dists referenced it only inside
function bodies, where the strip-plus-rebuild order didn't surface a
NameError.)
D-12-01 should_retry pure (5 reason values); same shape as should_gate.
D-12-02 isinstance() heuristic on _PERMANENT_TYPES + _TRANSIENT_TYPES.
D-12-03 OrchestratorConfig.retry_policy declarative (extra='forbid').
D-12-04 UI surfaces decision via preview_retry_decision (5-case map).
D-12-05 tests/test_framework_flow_control_e2e.py covers FOC-01..05.
D-12-06 single atomic commit.
29 new tests: 14 should_retry matrix + 6 e2e + 9 retry_button_state.
Total: 1026 passing (baseline 997 + 29). Phase 11's GateDecision /
should_gate surface untouched. Concept-leak ratchet stays binary-green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Manual end-to-end testing of v1.2 surfaced 8 latent bugs across the
arg-injection / gateway / LLM-provider stack that unit tests missed
because they used pydantic-model fixtures while real FastMCP tools
expose JSON-Schema dicts. All 8 are framework-level fixes — none
change v1.2's pure-policy thesis.
Bugs fixed:
1. ``strip_injected_params`` early-exited for dict-schema (FastMCP)
tools, leaking ``environment``/``incident_id``/``session_id`` to
the LLM-visible signature. LLM hallucinated values, fed garbage
back to the runtime, looped at the recursion ceiling. Fix: dict
branch removes injected keys from ``properties`` + ``required``
then ``model_copy``-s the tool.
2. New ``accepted_params_for_tool`` helper introspects both pydantic
and JSON-Schema-dict ``args_schema`` shapes. Used at all 3 inject
call sites (gateway ``_run`` / ``_arun`` / orchestrator
``_invoke_tool``).
3. ``inject_injected_args`` now drops LLM-supplied values for keys
the underlying tool doesn't accept. Prevents pydantic
``unexpected_keyword`` rejections when an LLM hallucinates an
injectable arg despite Phase 9 stripping it from the sig.
4. Gateway wrapper exposes a sanitized LLM-visible tool name
(``:`` → ``__``) so OpenAI's tool-naming regex
(``^[a-zA-Z0-9_-]+$``) and Ollama's
(``[a-zA-Z0-9_.\-]{1,256}``) both accept it. Inner tool name
stays colon-form so PVC-08 prefixed-form policy lookups are
preserved.
5. ``make_agent_node`` no longer double-strips: pass ORIGINAL tools
to ``wrap_tool`` (which strips internally for the LLM-visible
schema). Stripping twice hid injected keys from
``accepted_params``, the inject step skipped them, FastMCP
rejected the call as missing-required-arg.
6. ``_ChatOllamaJsonSchema`` subclass forces
``method='json_schema'`` on ``with_structured_output``. The
default ``function_calling`` method fails on Ollama models
that don't support native tool-calling (gemma, gpt-oss,
ministral) — they emit prose instead of JSON, langchain raises
``OutputParserException`` and Phase 10's envelope is never
parsed.
7. ``_try_recover_envelope_from_raw`` fallback in ``graph.py``
extracts envelope JSON from raw LLM output (markdown-fenced or
greedy ``{...}`` slice) when ``OutputParserException`` fires
inside ``create_react_agent``. Also adds ``recursion_limit=25``
to ``_ainvoke_with_retry`` so future infinite loops surface as
``GraphRecursionError`` instead of hanging silently.
8. New ``openai_compat`` provider kind (``_build_openai_compat_chat``)
wires OpenRouter / Together / vLLM / etc. via langchain-openai's
``ChatOpenAI`` with a ``base_url`` override.
Config:
- ``OrchestratorConfig.injected_args.environment`` now resolves via
``session.extra_fields.environment`` (was ``session.environment``).
Base ``Session`` class is domain-neutral; ``environment`` lives on
``IncidentState.extra_fields``. Mirrors how code_review's
``pr_url`` / ``repo`` were already declared.
- Workhorse model swapped to ``openrouter/openai/gpt-4o-mini``
(``openai_compat`` kind, ``OPENROUTER_API_KEY`` from .env). Ollama
models tested first — surfaced bugs 4-7 — but still need Phase 13
hardening for the ``response_format`` round-trip on tool-loop
termination.
Tests:
- ``test_orchestrator_injected_args_field_in_yaml`` updated to match
the new env path.
- Genericity ratchet baseline 153 → 154 (Phase 12 backfill — the
``Orchestrator._retry_session_locked`` retry-policy gate added one
``incident`` token reuse that was missed in ``be5d351``).
- Full suite: 1026 passing, 3 skipped, 0 failing.
Out of scope (deferred to v1.3 hardening):
- Real-LLM ``create_react_agent`` tool-loop termination with
``response_format=AgentTurnOutput``: gpt-4o-mini and Ollama
models reach the recursion limit without naturally terminating
the React loop. Likely the structured-output round and the
React END signal interact badly.
- Skill-prompt-vs-schema linter (raised during v1.1 testing).
- Bundler ``service.py`` inclusion (``OrchestratorService`` is not
in ``RUNTIME_MODULE_ORDER``; ``dist/ui.py`` imports it from
``app``, breaking ``streamlit run dist/ui.py``. Local dev runs
via ``PYTHONPATH=src:.`` work fine).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7 tasks
…841) Removes unused imports (asyncio, tool, Field, FakeMessagesListChatModel, AIMessage, ToolMessage, pytest) and two dead local assignments (inner, wrapper) flagged by ruff in CI. Pure cleanup — no behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
Contributor
Author
|
Superseded by #5 — v1.2 ruff cleanup landed in the squash-merge. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Ships v1.2 — Framework Owns Flow Control (4 phases, 6 requirements). The LLM provides intent (which tool, what hypothesis, what confidence); the framework owns everything else — environment injection, confidence enforcement, HITL gating, retry decisions.
78cd361c0688b7AgentTurnOutput.confidenceenvelope on every turnee3c453should_gate(...)+ interrupt-vs-error UI bug fixbe5d351should_retry(...)+ E2E genericity test3ba099fv1.2 thesis verified
tests/test_framework_flow_control_e2e.pyruns a session against a stub LLM that emits ONLY(tool_name, tool_args_excluding_session_data, confidence, signal). Framework injectsenvironmentfrom session, enforces confidence envelope, appliesshould_gatepolicy, and decides retry — all without any LLM-supplied state-derived data.Test counts and coverage
policy.py100% •orchestrator.py77-82% •gateway.py75% • new helpers (accepted_params_for_tool,_should_render_retry_block,_retry_button_state_for) all 100%3ba099f)gateway/HITL/approval/bypassmentionsNotable manual-test discoveries (folded into
3ba099f)Real-LLM testing surfaced 8 latent framework bugs that unit tests missed because they used pydantic-model fixtures while real FastMCP tools expose JSON-Schema dicts. All 8 are framework-level fixes — none change v1.2's pure-policy thesis.
strip_injected_paramsearly-exited for dict-schema tools — leaked injected keys to LLM-visible sig.accepted_params_for_toolhelper.inject_injected_argsnow drops LLM-supplied keys the underlying tool doesn't accept (defends against LLM hallucinating injectable keys despite stripping).:→__for OpenAI's tool-naming regex (^[a-zA-Z0-9_-]+$).make_agent_nodeno longer double-strips: pass originals towrap_toolsoaccepted_paramsis computed against the underlying schema._ChatOllamaJsonSchemasubclass forcesmethod='json_schema'(defaultfunction_callingdoesn't work on gemma/gpt-oss/ministral)._try_recover_envelope_from_rawparser fallback +recursion_limit=25safeguard.openai_compatprovider kind for OpenRouter / Together / vLLM.Out of scope (deferred to v1.3 hardening)
create_react_agenttool-loop termination withresponse_format=AgentTurnOutput— gpt-4o-mini and Ollama models reach the recursion limit without naturally terminating; structured-output round and React END signal interact badly.service.pyinclusion (OrchestratorServicenot inRUNTIME_MODULE_ORDER; affectsstreamlit run dist/ui.pyonly — local dev viaPYTHONPATH=src:.works).Test plan
streamlit run src/runtime/ui.py --server.port 37777withAPP_CONFIG=config/config.yaml,OPENROUTER_API_KEYset — start an INC, verify env auto-injected (no'prod'validation errors), confidence shows on every agent badge (no ⚪)🤖 Generated with Claude Code